Your request cart is empty!
Dataset Description
61,45,278 words | 4,31,27,842 characters | 6 Domains
Manipuri Text Corpus is encoded in a
machine-readable form and stored in a standard format. The major encoding being
used is Unicode and stored in XML format. The data is embedded with metadata
information. The corpus has been created from contemporary texts in a typed
method. LDC-IL Manipuri Text Corpus size is 6145278 words drawn from
1202 different titles. The six major domains are Aesthetics, Commerce, Mass
Media, Official Documents, Science & Technology and Social Sciences
respectively.
The available Text Corpus Details:
Domains | Words | Percentage of Total Corpus |
Aesthetics | 37,72,994 | 61.40 % |
Commerce | 18,450 | 0.30 % |
Mass Media | 7,75,261 | 12.62 % |
Official | 4,42,950 | 7.21 % |
Science and Technology | 3,04,545 | 4.96 % |
Social Sciences | 8,31,078 | 13.52 % |
- Ramamoorthy, L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu, Longjam Anand Singh & M. Bidyarani Devi. 2019. A Gold Standard Manipuri Raw Text Corpus. Central Institute of Indian Languages, Mysore.
- Choudhary, Narayan & L. Ramamoorthy. 2019. "LDC-IL Raw Text Corpora: An Overview" in Linguistic Resources for AI/NLP in Indian Languages, Central Institute of Indian Languages, Mysore. pp. 1-10.
Item specifics
- Authors Ramamoorthy L., Narayan Choudhary, Amom Nandaraj Meetei, Yumnam Premila Chanu, Longjam Anand Singh,Bidyarani Devi M
- Corpus Type Raw Corpus
- Catalogue Number 1146
- ISBN 978-81-7343-245-3
- Data Source Typed+Cleaned
- Character Count 43127842
- Word Count 6145278
- Release Date 04-Apr-2019
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.